Spark API Documents

I always find myself referencing the PySpark API documentation and have it opened as a seperate browser at work. A majority of your Spark application will be written with the functions found in the document.

It can be found with this link (I suggest you bookmark it 😀):

PySpark latest API docs

Ask Google

When in doubt ask Google, there are a lot of crowdsourced questions and answers on Stack Overflow.

Companies that Contribute to Spark

Databricks and Cloudera contribute heavily to Spark and they provide a lot of good blogs about writting performant Spark code.

Note: The author of Spark, Matei Zaharia also cofounded Databricks the company.

Conference Talks

There are a lot of Spark conferences throughout the year where speakers from the companies above or the big tech companies come speak about their advances and experiences with Spark at scale. I find these talks very insightful into writing "real big data" applications. These talks also cover a broader subject matter like how to manage a large spark cluster, etc.

Example: Apache Spark - Spark + AI Summit San Francisco 2018

Spark Books

The O'reilly books on Spark is how I got into Spark. They are either written by some highly profiled people in the Spark community (Holden Karau) or the original members that created Spark back in the AmpLabs days (Matei Zaharia).

The two that I would recommend are:

  • Learning Spark: Lightning-Fast Big Data Analysis

    • Back in the early days of Spark, this was the only book out there. I started off with this book. It gives a nice overview of everything in Spark.
    • It might be a bit outdated but none-the-less it will give you an appreciation for how far Spark has come.
    • This is written by Holden Karau and Matei Zahario most noticably.
  • Spark: The Definitive Guide: Big Data Processing Made Simple

    • This book is more up-to-date as it talks more in-depth about Spark SQL and the DataFrames API.
    • This book is written by Matei Zahario most noticably.

Spark Release Docs

Spark is an open-source project under Apache, and releases new features regularly. If you want to be up-to-date with the newest features I recommend following their releases/news:

Spark Releases

results matching ""

    No results matching ""